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ABSTRACT The enterococci are Gram-positive lactic acid bacteria that inhabit the gastrointestinal tracts of diverse hosts. How- 
ever, Enterococcus faecium and E.faecalis have emerged as leading causes of multidrug- resistant hospital-acquired infections. 
The mechanism by which a well-adapted commensal evolved into a hospital pathogen is poorly understood. In this study, we 
examined high-quality draft genome data for evidence of key events in the evolution of the leading causes of enterococcal infec- 
tions, including E.faecalis, E. faecium, E. casseliflavus, and E. gallinarum. We characterized two dades within what is currently 
classified as E. faecium and identified traits characteristic of each, including variation in operons for cell wall carbohydrate and 
putative capsule biosynthesis. We examined the extent of recombination between the two E. faecium clades and identified two 
strains with mosaic genomes. We determined the underlying genetics for the defining characteristics of the motile enterococci E. 
casseliflavus and E. gallinarum. Further, we identified species-specific traits that could be used to advance the detection of medi- 
cally relevant enterococci and their identification to the species level. 

IMPORTANCE The enterococci, in particular, vancomycin-resistant enterococci, have emerged as leading causes of multidrug- 
resistant hospital-acquired infections. In this study, we examined genome sequence data to define traits with the potential to 
influence host-microbe interactions and to identify sequences and biochemical functions that could form the basis for the rapid 
identification of enterococcal species or lineages of importance in clinical and environmental samples. 
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The enterococci are a diverse group of Gram-positive gastroin- 
testinal (GI) tract colonizers with lifestyles ranging from intes- 
tinal symbiont to environmental persister to multidrug-resistant 
nosocomial pathogen (1,2, 3). Enterococci are used in food pro- 
duction, in probiotic products, and for tracking fecal contamina- 
tion and thus also are of regulatory and industrial interest. Most 
enterococcal research has focused on the two species most associ- 
ated with human GI tract colonization and infection, Enterococ- 
cus faecium and Enterococcus faecalis (2, 3). Certain lineages, de- 
fined by multilocus sequence typing (MLST), are associated with 
hospital-acquired infections (e.g., E. faecium sequence type 17 
[ST17] ST18, ST78, and ST203 dindE. faecalis ST6, ST9, and ST40) 

(4) . Genome analysis has illuminated the extent of mobile content 

(5) and evolution of antibiotic resistance (6) in E. faecalis ST6 
strain V583 and the mobile element content and metabolic capa- 
bilities of E. faecium (7). Using genomic data we recently devel- 
oped for 28 enterococcal strains (8), we report and quantify diver- 
gence within what is currently classified as E. faecium and 
E. faecalis and identify the genetic bases for the defining charac- 
teristics of the motile enterococcal species Enterococcus gallinarum 



and Enterococcus casseliflavus. We identify loci homologous to 
those known to direct the synthesis of extracellular polymers that 
interact with host surfaces, including a putative E. faecium capsule 
locus. We additionally identif)^ genetic sequences and biochemical 
functions that represent distinguishing features of potential value 
for the rapid identification of enterococci to the species level. 

RESULTS AND DISCUSSION 

Phylogenetic analysis of enterococci. We recently announced the 
public release of genome sequence data for 28 enterococcal strains 
of diverse origin (8) (see Table SI in the supplemental material). 
The 16 E.faecalis genomes sequenced represent the deepest nodes 
in the MLST phylogeny, providing the greatest diversity. The 
strains include those of clinical, animal, and insect origins and 
were isolated from 1926 to 2005 (9). These strains represent ap- 
proximately 80 years of enterococcal evolution, spanning the pe- 
riods prior to and during widespread antibiotic use. Additionally, 
the genomes of 6 E. faecium, 1 E. gallinarum, and 3 E. casseliflavus 
clinical isolates from 2001 to 2005 and 2 human fecal E. faecium 
strains were examined. 
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FIG 1 Core gene tree. Concatenated sequences of 847 genes core to 30 enterococci and the outgroup species I. lactis were aligned, and a pliylogenetic tree was 
generated using RAxML with bootstrapping. The bootstrap value for all nodes outside the £. faecalis clade is 100. E. faecium clades A (blue) and B (red) are 
indicated. 
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OrthoMCL ( 10) was used to identify ortholog groups in the 30 
enterococcal genomes. Ortholog groups represented in all 31 ge- 
nomes were considered core groups, which were farther subdi- 
vided into single-copy (1 gene copy in each genome) and multi- 
copy (>1 gene copy exists in at least 1 genome). Genes not 
clustered were considered orphans. A phylogenetic tree generated 
from the concatenated sequences of 847 single-copy core genes is 
shown in Fig. 1. Relationships among the 18 £. faecalis strains, 
despite their diverse origins, cannot be fully resolved by this anal- 
ysis (based on lack of bootstrap support for branches within the 
E. faecalis branch; inset, Fig. 1 ) . As expected, E. casseliflavus and E. 
gallinarum branch separately, supporting their designation as dif- 
ferent species. Importantly, two clades were identified within the 
species E. faecium, as had been inferred by comparative genome 
hybridization, which suggested that hospital-associated isolates, 
including ST17 and ST18 isolates, may make up a distinct subspe- 
cies within E. faecium (11). The 3 vancomycin-resistant E. faecium 
strains in our collection are members of clade A, while the 2 hu- 



man fecal isolates are members of clade B (Fig. 1). To quantify the 
relationships among these strains, we generated average nucleo- 
tide identity (ANI) plots (Fig. 2), which have been used to query 
and refine prokaryotic species definitions (12, 13). 

E. faecium. The E. faecium ANI analysis refines phylogenetic 
relationships among clade A and clade B strains (Fig. 2). Within 
clade A, ST17 strain 410 and double-locus variants (DLVs) 933 
(ST18) and 502 (ST203) are closely related (99.2 to 99.4% ANI) 
whereas strains 501 (ST52) and 408 (ST582, an ST17 DLV) have 
lower ANI values with those strains, and each other (96.9 to 98.2% 
ANI). Similar ANI values were observed among clade B strains 
(97.9 to 99.4%). However, pairwise comparisons of clade A and 
clade B strains ranged from 93.9 to 95.6% ANI, overlapping an 
ANI species line of 94 to 95%. ANI values of 94 to 95% correlate 
with experimentally derived 70% DNA-DNA hybridization val- 
ues, a commonly accepted threshold for species designation (12, 
13, 14). Clade A and clade B maybe endogenous to the GI tracts of 
different hosts and now coexist among human flora as a result of 



2 mBio' mbio.astm.org 



January/February 2012 Volume 3 Issue 1 e00318-11 



^ 100 




vo 65 T 1 1 1 

92 94 96 98 100 

% AN I 

FIG 2 ANI plot. Each point represents a pairwise comparison of two ge- 
nomes. Grey diamonds, E./aecfl/is-E./aecafo comparisons. Blue circles, clade A 
E. faecium-dade A E. faeciiim comparisons. Red circles, clade B E. faecium- 
clade B E. faecium comparisons. Yellow circles, clade A E. faecium-dade B E. 
faecium comparisons. A species threshold of 94 to 95% ANI is indicated by the 
green-shaded area. 

antibiotic elimination of competitors, or clade A and clade B may 
be diverging from each other as a result of antibiotic use and 
ecological isolation (less likely because of the short time involved). 

For the 8 E. faecium strains in our collection, the two clades are 
recapitulated using the 7 housekeeping genes selected for E. fae- 
cium MLST (see Fig. SI in the supplemental material). Between 
clade A and clade B strains, the nucleotide identities of concate- 
nated MLST sequences range from 96.2 to 96.9% (compared to a 
93.9 to 95.6% ANI range). To determine whether a single marker 
is representative of either E. faecium clade, we examined the dis- 
tribution of individual MLST alleles among the E. faecium STs 
assigned to clade A or clade B (see Fig. SI in the supplemental 
material). A "minority allelic population" composed of 5 diver- 
gent STs was reported in seminal E. faecium MLST work (15). The 
5 divergent STs (ST39, ST40, ST60, ST61, and ST62) identified by 
that study belong to clade B (see Fig. SI in the supplemental ma- 
terial). The genomes of 7 additional E. faecium strains were re- 
cently sequenced (7), and we used MLST to assign them to clade A 
or B (see Fig. SI in the supplemental material). The assignment of 
one of these strains, E. faecium E980, to clade B is consistent with 
previous analyses demonstrating the phylogenetic distance of this 
strain from the other 6 (clade A) strains in that sequencing collec- 
tion (7). In the first-pass analysis, the allele adk-6, which differs 
from the ST 17 allele adk-1 at 3 synonymous sites, was observed to 
occur almost exclusively in clade B strains (see Fig. SI in the sup- 
plemental material). To further explore the distribution of adk 
alleles among E. faecium isolates, we extracted sequences of all 6 1 7 
available STs in the E. faecium MLST database and determined the 
extents of identity to an ST 17 (clade A) reference. In the MLST 
database, adk-1, adk-5, and adk-6 are the most abundant adk al- 
leles, representing 87% of the E. faecium STs. Of the 85 STs pos- 
sessing adk-6, 66 (78%) share 96 to 96.9% nucleotide identity with 
ST17, comparable to that observed for clade B-ST17 comparisons. 
Conversely, adk-1 and adk-5 occur primarily in STs with >99% 
identity to ST17. These data suggest that adk allele exchange is 



restricted, perhaps resulting from a barrier to DNA uptake such as 
clustered regularly interspaced short palindromic repeats 
(CRISPR)-cfls defense and/or from the proximity of adk to the 
replication origin (Fig. 3). 

E. faecium 408 is a DLV of ST 17 that possesses adk-6 and ddl-13 
(see Fig. SI in the supplemental material). Because adk-6 occurs 
mostly among strains with lower relatedness to ST 1 7, and ddl- 13 is 
present in two clade B strains (see Fig. SI in the supplemental 
material), we were curious about whether these alleles were ac- 
quired by recombination. Genome mosaicism is evident in E. fae- 
cium clade A strains 408 and 501 (Fig. 3). The occurrence of adk-6 
and ddZ-J3 within a hybrid region in E. faecium 408 (Fig. 3; see data 
set SI in the supplemental material) supports the acquisition of 
this region from a clade B strain. The putative genome defense 
system EfmCRISPRl-cfls (16), present in 2 of 3 clade B strains and 
in E. faecium 408 (see Table SI in the supplemental material), 
occurs within this region, suggesting that CRISPR-ccis was ac- 
quired by E. faecium 408 from a clade B strain via recombination. 
The hybrid region in E. faecium 501 includes pfop5 (Fig. 3; see data 
set SI in the supplemental material), which can confer ampicillin 
resistance. Our results indicate thatpbp5-S was acquired hjE. fae- 
cium 501 from a clade B strain. The hybrid region in 501 is flanked 
by a putative phage integrase (EFRG_00906) that is conserved 
among all of the E. faecium strains in our collection (see data set SI 
in the supplemental material). We recently reported an Hfr-like 
mechanism for the transfer of chromosomal genes between 
E.faecalis strains (17), and it seems likely that a similar mechanism 
functions in E. faecium. 

To determine whether specific traits define the two E. faecium 
clades, we searched for clade-specific ortholog groups present in 
and exclusive to all of the members of each clade. We then used 
representative gene sequences from each to search for similar se- 
quences in 7 additional -E./fledum genomes (7) assigned to clade A 
or B (see Fig. SI in the supplemental material). Of the clade 
A-specific genes (see data set S2 and Table S2 in the supplemental 
material), 8 are associated with a locus that has high sequence 
identity with and almost the same gene content as the ycjM- 
NOPQRSTUV locus of Escherichia coli, which is significantly en- 
riched in enteric clades (18) and also occurs in Listeria (19). The 
organization of this locus is similar to that of a Lactobacillus aci- 
dophilus fructooligosaccharide (prebiotic) utilization locus (20). 
Of the genes unambiguously assigned to clade B (see data set S2 
and Table S2 in the supplemental material), 5 encode putative 
transcriptional regulators with protein domain hits to Mga or Rgg, 
regulators of virulence, competence, and cell-cell signaling in 
streptococci (21, 22). Two of these putative regulators are diver- 
gently transcribed from genes that are also clade B specific, includ- 
ing a putative thioredoxin that could modulate the redox state of 
cellular targets in response to oxidative stress (23). A putative 
phospholipase C is also clade B specific. Finally, one clade 
B-specific gene (EFSG_01746) was useful in identifying a genomic 
insertion, composed of 17 genes, in E. faecium Tii (see Table S2 in 
the supplemental material). This region encodes a putative phos- 
photransferase system and a secreted hyaluronidase that could 
cleave the extracellular matrix of host cells. It is surprising that 
clade B (and not clade A, which contains all high-risk STs) strains 
encode a number of secreted factors that could interact with eu- 
karyotic cell surfaces. This suggests that clade B strains may be 
more closely associated with host tissues in the GI tract than clade 
A strains are, which possibly contributes to their persistence in the 
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FIG 3 E.faecium genome mosaicism plot. Tlie outermost ring stiows E.faecium Coml2 scaffolds, ordered by decreasing lengtli clockwise from scaffold 1, with 
each gene represented as a radial position along the ring. Each of the remaining? E. faecium genomes is represented by the rings below Coml2. Genes are colored 
by membership in clade A (blue) or clade B (red), as determined by individual gene trees built from ortholog groups. The strains shown, from the outermost to 
the innermost rings, are Coml2, 733, ComlS, 501, 408, 502, 933, and 410. The locations of dnaA, Coml2 MLST alleles, pfop5, and the EfmCRISPRl-cas locus are 



GI tract, whereas clade A strains may be more transient and asso- 
ciated with the GI lumen, which contributes to their dissemina- 
tion. 

E. faecalis. In contrast to E. faecium, little phylogenetic diver- 
gence was observed among E. faecalis strains (Fig. 2). Among 306 
pairwise comparisons, ANI varies within a narrow range (97.8 to 
99.5%). Instead, shared gene content among these strains varies 
(70.9 to 96.5%). For example, strain Til shares 96.5% of its 2,511 
genes with ST6 strain V583, while V583 shares only 72.8% of its 
3,265 genes with Til; they possess 99.5% ANI in the genes that 



they share. The genome size of Til is smaller than that of V583 
(2.74 Mb versus 3.36 Mb) and is similar to that of the oral isolate 
OGIRF (24), likely representing the minimal E. faecalis genome. 
For all 18 £. faecalis strains, genome sizes vary between the ex- 
tremes of Til and V583 (see Table SI in the supplemental mate- 
rial) . We recently proposed that loss of CRISPR-cas in founders of 
modern E. faecalis high-risk MLST lineages facilitated the influx of 
acquired antibiotic resistance genes and other mobile traits into 
these lineages (16). Genome size distribution significantly differs 
between strains possessing or lacking CRISPR-cas (P = 0.026; 
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FIG 4 i;./necfl/is genome mosaicism plot. Tlie outermost ring sliows£./flecfl/isV583 chromosomal (scaffold 4) and plasmid scaffolds (scaffold 1, pTEF2; scaffold 
2, pTEF3; scaffold 3, pTEF 1 ) , with each gene represented as a radial position along the ring. Each of the remaining 1 7 E.faecalis genomes is represented by the rings 
below V583. Genes are colored by phylogenetic distance from E.faecalis V583 (from dark to light green with increasing phylogenetic distance), as determined by 
individual gene trees built from ortholog groups. The strains shown, from the outermost to the innermost rings, are V583, Til, OGIRF, Merz96, T8, T2, D6, X98, 
T3, Tl, Flyl, CH188, HIP11704, ATCC 4200, ElSol, AROl/DG, DS5, and JHl. The locations of E.faecalis variable regions are shown (9). A, integrated plasmid; 
B, prophage 1; C, E.faecalis pathogenicity island; D, prophage 2; E, prophage 3; F, putative island; G, prophage 4; H, prophage 5; I, putative island; J, vancomycin 
resistance (vanB) transposon; K, integrated plasmid; L, prophage 6; M, prophage 7. 



one-tailed Wilcoxon rank-sum test), with a greater average ge- 
nome size in strains lacking CRISPR-ccis (3.1 Mbp versus 
2.9 Mbp). The distribution of domain motifs associated with mo- 
bile elements is significantly different in strains with genomes 
>3 Mb in size {P < 0.05), including the plasmid mobilization 
MobC domain (PF05713; P = 0.001), the antirestriction protein 
ArdA (PF07275; P = 0.032), the replication initiation factor do- 
main (PF02486; P = 0.001), a plasmid addiction toxin domain 



(TIGR02385; P = 0.032), and a transposase domain (PF01526; P 
= 0.021). This supports a model where increased genome size is 
the result of mobile element accretion, consistent with the prop- 
osition that compromised genome defense facilitated the accre- 
tion of mobile elements (16), resulting in larger genomes. 

We analyzed the 18 E.faecalis genomes for mosaicism (Fig. 4). 
Thirteen variable regions were previously defined for E. faecalis 
genomes using comparative genome hybridization to a V583- 
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TABLE 1 ANI and shared-gene analyses of E. casseliflavus and E. 
gallinarum 

% ANI, % shared gene content" 



Strain 


EClO 


EC20 


EC30 


EG2 


E. faecalis and 
E. faecium'' 


EClO 




98, 85 


100, 99 


74, 72 


65-66, 51-55 


EC20 


98, 88 




98, 88 


74, 74 


65-66, 52-56 


EC30 


100, 99 


98, 86 




74, 72 


65-66, 51-55 


EG2 


74, 78 


74, 77 


74, 78 




65-67, 55-60 



" The data shown are for genome 1 (left) compared to genome 2 (top). Values were 
rounded to the closest whole number. 
^ Ranges of values are shown. 



based microarray (9). Regions of mosaicism were detected in 
strains Merz96, JHl, and T2 overlapping the E. faecalis pathoge- 
nicity island (25) and two putative genomic islands containing 
Tn9J6-like genes (5), respectively. These results are consistent 
with conjugative acquisition of these islands and surrounding se- 
quence by Merz96, JHl, and T2 from strains closely related to 
V583 (17). Collectively, much of the diversity of E. faecalis can be 
attributed to the accretion of mobile genetic elements on a largely 
conserved genomic backbone, with those mobile elements facili- 
tating recombinatorial exchange of chromosomally encoded 
traits. 

The motile enterococci. Very little is known about the ge- 
nomes of E. casseliflavus and E. gallinarum. Once thought to be 
associated primarily with vegetation {E. casseliflavus [26]) and 
fowl [E. gallinarum [27] ) and only rarely found in humans, these 
species appear to be increasingly implicated in infections and hos- 
pital outbreaks (28, 29). Motility is a defining characteristic of 
most strains of E. casseliflavus and E. gallinarum, while E. casse- 
liflavus additionally produces a yellow pigment (30); however, 
there has been confusion because of phenotype variation (3). ANI 
analysis confirms that the E. casseliflavus and E. gallinarum strains 
in our collection, which possess -74% ANI in shared genes, are 
members of two separate species (Table 1). MotUe enterococci are 
reported to have < 3 or 4 terminal or lateral flageUa per cell (3 1 ). In 
the E. casseliflavus and E. gallinarum genomes, we identified con- 
served gene clusters encoding proteins predicted to synthesize, 
export, and power a flagellum, as well as a chemotactic response 
system (see data set S3 in the supplemental material). Most of the 
proteins predicted to be encoded by the representative E. casse- 
liflavus EClO motility gene cluster have best BLASTP hits to Lac- 
tobacillus ruminis proteins (see data set S3 in the supplemental 
material) (32). 

Bacterial motility is often regulated by the second messenger 
cyclic di-GMP (c-di-GMP), as are attachment to surfaces and pro- 
duction of extracellular polysaccharides (33). We identified puta- 
tive diguanylate cyclases possessing GGDEF domains (for c-di- 
GMP synthesis) and phosphodiesterases possessing EAL domains 
(for c-di-GMP turnover) in all 3 E. casseliflavus strains (see data 
set S3 in the supplemental material) but not in E. gallinarum. Two 
GGDEF proteins and one EAL protein are encoded 5' to a pre- 
dicted protein possessing glycosyltransferase, cellulose synthase, 
and PilZ domains (see data set S3 in the supplemental material). 
This protein shares high identity with a Clostridium difficile pro- 
tein thought to be regulated by c-di-GMP (CD2545 [34]) (see data 
set S3 in the supplemental material). PilZ domains bind c-di- 
GMP (33), and it is possible that this domain regulates cellulose 



synthesis or the production of another extracellular polymer in E. 
casseliflavus. 

E. casseliflavus produces a cell-associated carotenoid pigment 
thought to facilitate its environmental persistence by protecting 
against photooxidation (35). Streptococcus aureus also produces a 
carotenoid pigment and virulence factor, staphyloxanthin, that 
protects it from host-induced oxidative damage and antimicrobial 
peptides, and inhibitors show promise as novel therapeutics (36). 
Compared to S. aureus CrtOPQMN, E. gallinarum and E. casse- 
liflavus share CrtM and CrtN homologues that catalyze the first 
steps of staphyloxanthin biosynthesis (37). However, only E. cas- 
seliflavus possesses CrtO, CrtP, and CrtQ (see data set S3 in the 
supplemental material). Most ligand interaction sites (36) are 
conserved in the E. casseliflavus and E. gallinarum CrtM proteins 
(see Fig. S2 in the supplemental material), suggesting that CrtM 
inhibitors could be usefully applied to these bacteria. 

E. faecalis and E. faecium extracellular polysaccharides. Cell 
wall polymers produced by E. faecalis include a lipoteichoic acid 
(LTA) with a poly(glycerol-phosphate) backbone (38, 39); a pu- 
tative wall teichoic acid ( WTA) composed of glycerol, glucose, and 
phosphate (40); and a rhamnopolysaccharide (the enterococcal 
polysaccharide antigen or Epa) composed of rhamnose, 
N-acetylglucosamine, N-acetylgalactosamine, glucose, and galac- 
tose (40, 41, 42). The E. faecalis epa locus directs the synthesis of 
the Epa polymer (43), although the biochemical functions and 
essentiality of most Epa proteins are unknown. The production of 
an antiphagocytic capsule composed of galactose, glucose, and 
phosphate is strain variable in E. faecalis and dependent on the 
presence of the cps locus (9, 40). Other than the E. faecalis epa and 
cps loci and E. faecalis bgsA and bgsB, which are involved in LTA 
biosynthesis (44), the genetic bases of extracellular polymer bio- 
synthesis in enterococci are largely uncharacterized, as is the ge- 
netic basis of variable phagocytosis resistance in the species E. fae- 
cium (45). We therefore examined the distributions of epa, cps, 
and a predicted LTA biosynthesis pathway (46) and searched for 
new loci potentially important for decoration of the enterococcal 
cell surface. 

As expected based on previous work (42), the entire epa lo- 
cus — encompassing epaA to epaR — is core to the species E. faeca- 
lis. An epa locus varying in organization and content from that of 
E. faecalis is also core to E. faecium. In E. faecium, the genes are 
ordered epaABCDEFGH-epaPQ-epaLM-[an E. faecium-specific 
gene] -epaOR. The intervening E. faecium-specific gene encodes a 
protein with N-terminal similarity to E. faecalis EpaN but was 
not identified as being orthologous to epaN by OrthoMCL. Both 
the E. faecalis EpaN and E. faecium-specific proteins have a pre- 
dicted S-adenosylmethionine binding site in the N terminus 
but are divergent in C-terminal sequence. The E. faecium epa 
locus may direct the synthesis of a previously reported E. faecium 
tetraheteroglycan composed of galactose, rhamnose, N-acetyl- 
glucosamine, glucose, and phosphate (47). The conservation of 
most of the epa locus suggests that if Epa biosynthesis enzymes 
were targeted with novel antimicrobials, those antimicrobials 
could be effective against the enterococcal species of greatest con- 
cern to human health. Proteins predicted to be involved in 
E. faecalis V583 LTA biosynthesis (46) were additionally identified 
as being core to E. faecalis and E. faecium, as well as the other 
enterococcal species in our collection (see data set S4 in the sup- 
plemental material). 

Potentially important variation in E. faecalis and E. faecium epa 
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operons occurs between orthologs of epaR (EF2177 in V583) and 
EF2165 (see Fig. S3 in the supplemental material). Variation in 
this region was previously reported between E. faecalis strains 
V583 and OGIRF (42). The variable regions of the 26 E. faecalis 
and E.faecium strains, which consist of 37 ortholog groups and 1 1 
orphans (excluding transposases), encode predicted glycosyl- 
transferases and other proteins with likely roles in extracellular 
polysaccharide production (see data set S4 in the supplemental 
material). The 3 vancomycin-resistant E.faecium CC17 strains in 
our collection possess a unique epa locus configuration with pu- 
tative sialic acid biosynthesis (neuABCD) genes within the variable 
region, and a divergently transcribed, predicted jS-lactamase gene 
inserted in the core epa region between epaO and epaR (see Fig. S3 
in the supplemental material). Sialic acid decoration by patho- 
genic bacteria is thought to be a form of molecular mimicry that 
interferes with detection by the host immune system (48). The 
neuABCD genes are not clade specific and are also present in clade 
B strain E980 and clade A strain U0317, suggesting that the epa 
region can be either lost or transferred between E. faecium clades. 
The potential for sialic acid decoration on high-risk, vancomycin- 
resistant strains has important implications for vaccine develop- 
ment. We additionally identified putative WTA biosynthesis 
genes {tagFI tarF and tagD/tarD [49] ) in a subset of E. faecalis and 
E. faecium epa variable regions (see Fig. S3 and data set S4 in the 
supplemental material). The core epaA and epaLM genes encode 



proteins similar to Bacillus subtilis TagO and TagGH (43), which 
catalyze the initial step in WTA synthesis and the export of the 
assembled WTA polymer, respectively (49). 

Phagocytosis resistance in E. faecalis is associated with capsule 
production (40), which is a variable trait of that species (9). We 
examined the distribution of the E. faecalis cps capsule locus and 
found that it occurs only in E. faecalis, with little variation (see data 
set S4 in the supplemental material). We identified a novel 
capsule-like region in E. faecium (Fig. 5; see data set S4 in the 
supplemental material) that includes a phosphoregulatory system 
conserved among all species except E./aecflfc (see data set S4 in the 
supplemental material). The proteins encoded are similar to the 
YwqCDE proteins of 5. subtilis and the CpsBCD proteins of Strep- 
tococcus pneumoniae (Table 2), protein-tyrosine kinase/dephos- 
phorylase regulatory systems that regulate UDP-glucose dehydro- 
genase activity (50) and capsule production (51), respectively. 
This system is located 5' to a variable cohort of putative extracel- 
lular polymer biosynthesis genes in E. faecium (Fig. 5; see data 
set S4 in the supplemental material). This genetic configuration is 
similar to that of S. pneumoniae cpsABCD, which is core to the 
capsule biosynthesis loci of 90 pneumococcal serotypes (52). In 
E. faecium, these genes are oriented cpsACDB (Table 3; see data 
set S4 in the supplemental material). Because cps nomenclature is 
already used for the unrelated E. faecalis capsule biosynthesis locus 
(40), we refer here to the E.faecium cpsACDB genes by the alter- 
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TABLE 2 BLAST and Pfam analyses of putative phosphorej 
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Representative 


r ram nil 


S. pneumoniae 


B. subtilis 


locus'' 
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TIGR4 


168 


EFPG_02020 


None 


None 


None 


EFPG_02021 


LytR_cpsA_psr {2.7e-49) 


SP_1942, transcriptional 


Membrane-bound transcriptional 






regulator (9e-75, 65%, 95%)^ 


regulator LytR {le-83, 67%, 91%) 


EFPG_02022 


Wzz(l.le-19) 


Capsular polysaccharide 


YwqC, modulator of YwqD 






biosynthesis protein 


protein tyrosine 
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CbiA{1.8e-14) 


Capsular polysaccharide 


PtkA/YwqD, protein 






biosynthesis protein 


tyrosine kinase 






Cps4D (9e-42, 56%, 87%) 


(7e-74, 71%, 93%) 


EFPG_02024 


None 


Capsular polysaccharide 


YwqE, protein tyrosine 






biosynthesis protein 


phosphatase 






Cps4B (2e-25, 49%, 88%) 


(5e-68; 61%, 100%) 



" From £./flec(wm 933. 

^ Pfam hits with E values of ^ 10^^ are shown. 
Values in parentheses are E value, % similarity, and % query coverage. 

The second-best hit was capsular polysaccharide biosynthesis protein Cps4A (7e-31, 79%, 55%). 



nate S. pneumoniae gene names wzg, wzd, wze, and wzh, respec- 
tively (52). 

We subjected the variable region between wzh and 
EFVG_00414 in each E. faecium genome to BLASTP and 
conserved-domain analyses (see data set S4 in the supplemental 
material). In S. pneumoniae capsule production, a sugar trans- 
ferase (WchA) initiates capsule biosynthesis at the membrane by 
transferring an initial sugar to an undecaprenyl-phosphate car- 
rier; additional sugars are then transferred to the repeat unit by 
glycosyltransferases, the structure is flipped across the membrane 
by the Wzx flippase, and additional repeat units are added by the 
Wzy polymerase (52). WchA-like proteins were identified in each 
of the 8 E. faecium variable regions, as were many predicted gly- 
cosyltransferases and enzymes likely to generate activated sugar 
moieties for transfer (Fig. 5). Wzx and Wzy homologues were 
identified in some, but not all, E. faecium strains (Fig. 5). Phago- 
cytosis resistance is variable among E. faecium isolates (45), and no 
mechanism has been reported for this clinically relevant pheno- 
type. It is likely that the putative capsule locus and/or variable epa 
loci described here contribute. We additionally identified putative 
wzg, wzd, wze, and wzh sequences in the enterococcal species 
E. saccharolyticus and E. italicus (data not shown), suggesting that, 
at least among the sequenced enterococci, E. faecalis is the excep- 
tion, lacking this capsule biosynthesis pathway. 

Species-specific signatures. We used a combination of data, 
including Biolog carbon substrate catabolism analysis of a subset 
of our strains (see Materials and Methods), ortholog groups, and 



the Comparative Metabolism tool within a computationally gen- 
erated database of predicted metabolic pathways (EnteroCyc; 
http://enterocyc.broadinstitute.org), to identify species-specific 
biochemical traits and nucleotide sequences that could augment 
existing methodologies to classify enterococcal isolates. For Biolog 
analysis, we focused on carbon substrates having the strongest 
species-specific signatures (Table 3). 

Inulin fermentation was reported to be a distinguishing char- 
acteristic of motile enterococci (31), and our Biolog analysis con- 
firmed that inulin metabolism is restricted to E. casseliflavus. Ad- 
ditionally, genes for acetoin dehydrogenase (ECAG_02019 to 
ECAG_02022), which converts acetoin to acetaldehyde and acetyl 
coenzyme A, are unique to E. casseliflavus. Catabolism of 
a-ketovalerate is specific to E. faecalis, as are genes (bkdDABC; 
EF1661 to EF1658) encoding a previously characterized 
branched-chain a-keto acid dehydrogenase complex (53). The 
eiifBC genes (EF1629 and EF1627, respectively) directing ethanol- 
amine catabolism and the formate dehydrogenase gene fdhA 
(EF1390) are also E. faecalis specific. Catabolism of the cyclic oli- 
gosaccharide y-cyclodextrin, an additive in pharmaceuticals and 
other products (54), is enriched in E. faecium, while a gene for 
glutaminase (EFTG_00235), which converts glutamine to gluta- 
mate and ammonia, is E. faecium specific. Probes targeting c-di- 
GMP signaling (see data set S3 in the supplemental material) and 
acetoin dehydrogenase genes (E. casseliflavus); the eutBC, fdhA, 
and bdkDABC genes (E. faecalis) ; and the glutaminase gene {E. fae- 
cium) could be used to discriminate these different enterococcal 



TABLE 3 Biolog carbon catabolic substrates with the strongest species-specific signatures 
OD591) ratios" 

E. casseliflavus E. faecium E. faecalis 

Chemical EClO EC20 EC30 408 410 933 Coml2 Coml5 733 V583 T8 Tl X98 ElSol AROl/D.G. Flyl T3 

a-Ketovalerate 4.6,3.5 3.8,3.4 3.7,4.5 3.8,3.7 3.5,2.7 5.4,5.6 3.6,3.6 3.8,3.8 

7-Cyclodextrin 7.0,7.2 7.4,6.2 4.4,3.4 7.5,6.8 4.7,5.3 6.7,6.8 2.4,2.6 
Inulin 5.2,2.8 3.6,4.0 4.2,4.6 

" Shown are ratios of the OD59Q in a carbon-containing well to the OD^gQ of a no-carbon-added control well after 48 h incubation at 37°C. A ratio >2 was considered a positive 
result. Each strain was tested twice, and the data shown are for both trials. Ratios of <2 are not shown. All ratios for E. gallinarum were <2. 
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species. We did not detect E. faecium clade-specific metabolism 
using Biolog analysis or EnteroCyc predictions; however, clade- 
specific gene sequences (see Table S2 in the supplemental mate- 
rial) could be used as molecular probes. 

Perspectives. A comparative genomic approach was used to 
address gaps in our knowledge of _E;iterococcus, abacterial genus of 
importance to human health. Our phylogenetic analysis of E. fae- 
cium reveals and quantifies the distance that separates two distinct 
phylogenetic clades between which gene exchange has occurred. 
E. faecium clade-specific genes (see data set S2 and Table S2 in the 
supplemental material) are suggestive of different niches for clade 
A and clade B E. faecium in the GI tract. Additionally, conserved 
and variable pathways that appear to be important for cell wall 
polymer biosynthesis were identified. In contrast to E. faecium, a 
multiclade structure was not observed in E. faecalis, for which the 
acquisition of mobile elements appears to be a major source of 
genome diversity. Antibiotic resistance and pathogenicity island 
traits have converged in E. faecalis lineages (9), represented by 
strains V583, T8, and CH188. Despite the convergence of similar 
traits in those lineages and similar genome sizes (>3 Mb), sub- 
stantial differences in gene content exist. Ecotypes defined by spe- 
cific mobile element cohorts may be identified within high-risk 
lineages or in lineages with variable CRISPR-cas status (e.g., ST40 
and ST21 [16]). Finally, comparative genomics highlighted fun- 
damental differences between E. casseliflavus and E. gallinarum. 
The importance of the occurrence of motility operons in both but 
of genes related to the formation and function of the c-di-GMP 
second messenger only in E. casseliflavus and the impact of motil- 
ity on metabolism represent interesting areas for future explora- 
tion. 

MATERIALS AND METHODS 

Enterococcal strains and genome sequencing. E. faecalis strains were se- 
lected for genome sequencing to represent the diversity of a collection of 
106 isolates previously characterized (9). The E. faecalis V583 and OGIRF 
genome sequences were previously reported (5, 24). The£. casseliflavus, E. 
gallinarum, and 6 E. faecium strains were obtained from a repository of 
clinical isolates (Eurofins Medinet). E. faecium Coml2 and Coml5 were 
isolated from feces of healthy human volunteers under Schepens Eye Re- 
search Institute Institutional Review Board protocol 2006-02, Identifica- 
tion of Pathogenic Lineages of E. faecalis. E. faecium STs were previously 
determined (16, 55), and E. faecium MLST data were accessed at http: 
//efaecium.mlst.net. The sequencing, assembly, annotation, and rapid 
public release of these genome sequences have been previously described 
(8). 

Standard analyses, OrthoMCL, and EnteroCyc. Orthologous gene 
groups were identified using OrthoMCL (10), with an all-versus-all 
BLAST cutoff of lE^^. Lactococcus lactis subsp. cremoris SKll plasmid 
(NC_008503 to NC_008507) and chromosomal (NC_008527) genes were 
included as the outgroup. Coding sequences were aligned using Muscle 
(56), and poorly conserved regions were trimmed using trimAI (57). All 
trimmed alignments were concatenated and used to estimate phylogeny 
using maximum likelihood and 1,000 bootstrap trials as implemented by 
RAxML (58) using the rapid-bootstrapping option and the GTRMIX 
model. Conserved protein domains were predicted using HMMER3 (59) 
to search the Pfam (release 24; http://pfam.janelia.org) (60) and TIGRfam 
(release 10) (61) databases. The statistical significance of differences in 
genome size and conserved protein domain distribution was assessed us- 
ing the one-tailed Wilcoxon rank sum test. Membrane helix predictions 
were generated with transmembrane protein topology with a hidden 
Markov model (14). Protein subcellular localization predictions were 
generated using PsortB (62). Sequence alignments and phylogenetic trees 
in the figures in the supplemental material were generated with ClustalW 



in MacVector. Enzyme Commission (EC) numbers for the proteins in 
EnteroCyc (http://enterocyc.broadinstitute.org/) were predicted using 
gene coding sequences (CDS) and BLASTX to search the KEGG database 
(release 56) (63) and assigning EC numbers based on the KEGG annota- 
tion. Only significant hits with an E value of <1E^^'' and 70% overlap 
were considered. Pathways, operons, transporters, and pathway holes 
were predicted using the Pathway Tools software suite (64, 65). Unless 
otherwise noted, BLASTP and nucleotide megaBLAST queries were exe- 
cuted against the NCBI nonredundant protein sequence, nucleotide col- 
lection, and whole-genome shotgun read databases using NCBI BLAST. 
Proteins encoded by the E. casseliflavus EC 10 motility locus were com- 
pared to a B. subtilis 168 reference using BLASTP (see data set S3 in the 
supplemental material); the B. subtilis 168 flagellum is a reference Gram- 
positive flagellum in the KEGG database (http://www.genome.jp/kegg 
-bin/ show_pathway?bsu02040 ) . 

ANI and shared-gene analyses. OrthoMCL ortholog groups were 
used to determine shared gene contents in pairwise genome comparisons. 
For a genome pair (genome 1 and genome 2), the total number of genes in 
genome 1 was determined and the number of genes in genome 1 shared 
with genome 2 (based on shared ortholog group membership) was deter- 
mined. Percent shared gene content was calculated by dividing the num- 
ber of genome 1 genes shared with genome 2 by the number of genes in 
genome 1. Nucleotide alignments of shared genes were used to determine 
the numbers of identical and different nucleotide residues in shared genes. 
For comparisons within species, at least 2,113 gene sequences were uti- 
lized. Percent ANI was calculated by dividing the number of identical 
nucleotide residues in shared genes by the total number of nucleotide 
residues. 

Recombination analysis. See the Text SI in the supplemental material 
for a description of the methods used for genome mosaicism analysis and 
plot generation. 

Biolog analysis. A subset of strains {SI 18 E. faecalis, 6/8 E. faecium, 3/3 
E. casseliflavus, and 1/1 E. gallinarum) representing the diversity of the 
collection were analyzed in duplicate by Biolog Phenotype microarrays in 
accordance with the manufacturer's instructions. Optical density at 590 
nm (OD590) was read using a synergy 2 microplate reader (Bio-Tek). The 
48-h OD5gQ reading of each well containing a carbon source was divided 
by the ODggg value obtained for the negative-control well. A ratio which 
gave a reproducible value of 2 X the background was considered to be a 
positive result. 
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